1. Import and understand the data. [5 Marks]

A. Import ‘signal-data.csv’ as DataFrame. [2 Marks]

B. Print 5 point summary and share at least 2 observations. [3 Marks]

Insights:

Ignoring NaN values, we can still get the 5 point summary using describe.

Insights:

2. Data cleansing: [15 Marks]

A. Write a for loop which will remove all the features with 20%+ Null values and impute rest with mean of the feature. [5 Marks]

B. Identify and drop the features which are having same value for all the rows. [3 Marks]

C. Drop other features if required using relevant functional knowledge. Clearly justify the same. [2 Marks]

Since Time column doesnt contribute to the target column we can drop the column. We are not trying to do time series forecasting in this use case. We are trying to solve a classification problem here and hence Time cloumn can be dropped.

Let us find out if there are columns which are exactly same and remove them.

D. Check for multi-collinearity in the data and take necessary action. [3 Marks]

Insights:

Removing columns with high collinearity

Insights:

E. Make all relevant modifications on the data using both functional/logical reasoning/assumptions. [2 Marks]

There are no column with zero variance. Since we have removed columns with same values, we may not see any column with 0 variance or 0 STD.

From here on, Pass will be tracked as 1 and Fail will be tracked as 0

Data cleansing Summary:

3. Data analysis & visualisation: [5 Marks]

A. Perform a detailed univariate Analysis with appropriate detailed comments after each analysis. [2 Marks]

Number of Pass cases are way higherthan number of Fail case. We need to handle the imbalance in data later in this exercise.

Let us now see the distribution of all features individually

Insights:

B. Perform bivariate and multivariate analysis with appropriate detailed comments after each analysis. [3 Marks]

Insights:

Insights:

Insights:

4. Data pre-processing: [10 Marks]

A. Segregate predictors vs target attributes. [2 Marks]

B. Check for target balancing and fix it if found imbalanced. [3 Marks]

C. Perform train-test split and standardise the data or vice versa if required. [3 Marks]

D. Check if the train and test data have similar statistical characteristics when compared with original data. [2 Marks]

Insight:

- The box plots show similar data dataribution on all 4 columns.
- The outliers also seem to be distributed in similar way across train, test and orginal data
- Q1, Q2, Q3 all are similar across train, test and orginal data. 

Insight:

- The box plots show similar data dataribution on all columns.
- The outliers also seem to be distributed in similar way across train, test and orginal data
- Q1, Q2, Q3 all are similar across train, test and orginal data for all columns.

5. Model training, testing and tuning: [20 Marks]

A. Use any Supervised Learning technique to train a model. [2 Marks]

Support Vector Classification model

Insight:

Insights on the confusion matrix

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 67% of the time.
Model predicted correctly that the case as Fail for 100% the time.

By F1 score we can say that precison and recall is balanced for Pass by 80% and for fail by 65% Overall Test accuracy is 75%

B. Use cross validation techniques. [3 Marks]

Hint: Use all CV techniques that you have learnt in the course.

Showing various CV techniques

Using GridSearchCV

Using RandomizedSearchCV

Using cross_val_score

C. Apply hyper-parameter tuning techniques to get the best accuracy. [3 Marks]

Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.

Best Hyperparameters: {'gamma': 0.01, 'C': 10} and test accuracy is 99.83%

Insights on the confusion matrix

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 100% of the time.
Model predicted correctly that the case as Fail for 100% the time.

By F1 score we can say that precison and recall is balanced for Pass by 100% and for fail by 100% Overall Test accuracy is 100%

D. Use any other technique/method which can enhance the model performance. [4 Marks]

Hint: Dimensionality reduction, attribute removal, standardisation/normalisation, target balancing etc.

Insights:

Insights:

Logistic Regression model

With just 125 features we are able to create model with test accuracy of 89.42%. Also test set accuracy is closer to train set accuracy.

E. Display and explain the classification report in detail. [3 Marks]

Insights on the confusion matrix

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 93% of the time.
Model predicted correctly that the case as Fail for 86% the time.

By F1 score we can say that precison and recall is balanced for Pass by 89% and for fail by 90% Overall Test accuracy is 89%

F. Apply the above steps for all possible models that you have learnt so far. [5 Marks]

Just above this we saw how logistic regression model worked with all applied techniques. We will see rest of the models on the data reduced using PCA.

SVC model

Best parameters for rbf_model are {'g': 0.01, 'c': 10} and the accuracy is 'acc': 99.65870307167235

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 100% of the time.
Model predicted correctly that the case as Fail for 99% the time.

By F1 score we can say that precison and recall is balanced for Pass by 100% and for fail by 100% Overall Test accuracy is 99.65%

XGBClassifier model

Best performing parameters for this model xgb_clf are gamma=0, learning_rate=0.300000012, max_depth=6, estimators=100, random_state=5, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=5, tree_method='exact'

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 100% of the time.
Model predicted correctly that the case as Fail for 100% the time.

By F1 score we can say that precison and recall is balanced for Pass by 100% and for fail by 100% Overall Test accuracy is 99.82% i.e ~100%

RandomForestClassifier model

KNeighborsClassifier Model

DecisionTreeClassifier model

GradientBoostingClassifier Model

GaussianNB Model

6. Post Training and Conclusion: [5 Marks]

A. Display and compare all the models designed with their train and test accuracies. [1 Marks]

B. Select the final best trained model along with your detailed comments for selecting this model. [1 Marks]

RandomForestClassifier, Support Vector Classifier, XGBoost Classifier are best models in this problem to correctly predict Pass and Fail cases. However, RandomForestClassifier, Support Vector Classifier are always good with higher dimensional data, where as XGBoost on high dimensional data will lead to high memory consumption and potentially lead to out-of-memory error. So XGBoost may not be preferred for productionizing the model and to handle larger data.

Within RandomForestClassifier and Support Vector Classifier, RandomForestClassifier suits very well for multi class problem and when we have mixture of numerical and categorical features. In this problem we just do binary classification - Pass or Fail, and also we have all features as numerical values. In other words SVM is sufficent for this problem, and RandomForestClassifier may tend to overfit in this case.

Finally Support Vector Classifier can be selected which has 99.65% accuracy in test and 100% accuracy in train dataset.

C. Pickle the selected model for future use. [2 Marks]

D. Write your conclusion on the results. [1 Marks]

Support Vector Machine - Classifier model results:

Insights on the confusion matrix for Train dataset

Insights on the confusion matrix for Test dataset

Insights on Test data prediction:

Precision: Out of all predicted values, what fraction are predicted correctly Recall(sensitivity or TPR): Out of all actual values how much fraction we identified correctly

Model predicted correctly that the case as Pass for 100% of the time.
Model predicted correctly that the case as Fail for 99% the time.

By F1 score we can say that precison and recall is balanced for Pass by 100% and for fail by 100% Overall Test accuracy is 99.65%

THE END